R and RStudio are free to download. Google them, download, and install them. Then open up RStudio.
First install R:IDE stands for integrated development environment. One of the best things about R is that the community has converged on using one IDE and that is RStudio. There are some other options (e.g. vim and VisualStudio) but the vast majority of folks use RStudio.
For those unfamiliar with IDEs, it will be where you write, run, and test R code. There’s a number of other features that make it convenient for development but these are the core aspects.
To create a script, File > New File > R Script. Wa-lah! You have your first R script. Pretty boring at first but it will soon be filled with objects and functions that make nice visualizations or analyses for your data.
To get in a cood habit of using the IDE, we’ll write our code in the text editor portion of the IDE and then run it down in the console. This will save you a lot of headache compared to simply writing code down in the console and running the code.
To start, let’s create an object ‘x’ which is a vector of numbers.
x <- c(4, 5, 2, 7, 8, 9, 1)
x
## [1] 4 5 2 7 8 9 1
To run this code, you’ll want to highlight the code in the text editor and then hit Cmd + E on Macs or Ctrl + E on Windows. Additionally, there is a ‘Run’ button in the top right corner of the text editor which does the same thing as these shortcuts.
Awesome! We just created an object! It seems pretty uneventful at the moment but this is a huge step towards effectively working with data in R. There’s a few things to point out here about x. First, we created it by assigning it with ‘<-’ to a vector of numbers which we’re concatenated (by c).
I called this object ‘x’. Stylistically, this is not the best name for a variable. Typically, you want to call it something more informative like ‘price’ or ‘weight’. For this example, though, we’ll just keep it as ‘x’ for simplicity. Getting into style is beyond the scope of this tutorial, if you’d like to learn more about style – Google has put together a nice guide on how to style your R code.
If you’ve worked with another programming language, you might be asking about ‘<-’. Namely, “Why didn’t you use an equal sign?” you may ask and I don’t have a great answer for that. I will say that you can use ‘=’ in place of ‘<-’ for the vast majority of your code and the code will run just the same. However, they are different in their scope, as shown in these examples here.
One other practice that is important for coding is documentation. Documentation allows you to help you plan your code, keep track of your thoughts and rationale, and allows you to easily share your code with others. One form (and the most important form) of documentation is called commenting. Commenting can be done in R using the ‘#’ function. If put a ‘#’ in your code, R will not execute this line as code. Instead, R will ignore it.
# 'x' is a vector of values
x <- c(4, 5, 2, 7, 8, 9, 1) # R will run the x <- c(4, ..., 1) code, but it won't run the text after '#'
# show x
x
## [1] 4 5 2 7 8 9 1
In R, you can interact with your objects in a variety of ways. Let’s first bring back our object ‘x’.
x <- c(4, 5, 2, 7, 8, 9, 1)
x
## [1] 4 5 2 7 8 9 1
As you can see, the number 4 is in the 1st position of x, 5 is in the 2nd position of x, 2 is in the 3rd position of x, and so on. We can first detemine how many numbers are in object ‘x’ using the length() function. The eponymous length() function returns the the length of vectors (Hint: Type help(length) for a more information).
length(x)
## [1] 7
When we execute the length function, we find that object ‘x’ has a length of 7. That is, object ‘x’ contains 7 values.
Now let’s say we wanted to get one value of object ‘x’. We can do that by indexing into ‘x’. For this example, let’s grab the 5th position of ‘x’. We know that the 5th position of x is 8. We can accomplish this in R using the square bracket operator: ‘[]’.
x[5]
## [1] 8
To clarify, we called the object ‘x’ and then used the ‘[]’ and an index number ‘5’ to get the value 8. You’ll find that other languages often begin indexing at 0 instead of 1 (e.g. C, Python, and Javascript).
We can also change, or reassign, specific elements of ‘x’ using the ‘[]’ operator. Let’s say we wanted to change the 5th position of x from 8 to 3. We can do this by calling the object ‘x’ and use the ‘[5]’ operator to index into the 5th position of ‘x’. We then use the assign function ‘<-’ followed by ‘3’ to change the 5th element of ‘x’ from 8 to 3.
x[5] <- 3
x
## [1] 4 5 2 7 3 9 1
By calling ‘x’, we can see that we successfully changed the 5th position of ‘x’ from 8 to 3.
Now let’s say you wanted to remove a specific position from object ‘x’. For instance, let’s remove the 3rd position of ‘x’, which should be 2.
x[3]
## [1] 2
To remove this value, we do something slightly different. Instead of using just ‘3’ in the ‘[]’ operator, we will put ‘-3’ in the ‘[]’.
# Remove a value by using the '-' operator
x[-3]
## [1] 4 5 7 3 9 1
However, if we call ‘x’ again, we find that the 3rd position is still there.
x
## [1] 4 5 2 7 3 9 1
To save this change, we have to reassign ‘x’ to ‘x[-3]’.
# Reassign 'x' to save the updated version of 'x' that removes the value in the 3rd position
x <- x[-3]
x
## [1] 4 5 7 3 9 1
One last basic operation is appending or adding an element to the end of ‘x’. As you’ll find in programming, there are multiple solutions to an answer. Some are a little more efficient than others, but we will worry about that later. We will show you 2 of how to append or add to your ‘x’ object.
The first way is familiar since it is how we first created the ‘x’ object. We will use the c() or the combine fuction.
# Recreate 'x'
x <- c(4, 5, 2, 7, 8, 9, 1)
x
## [1] 4 5 2 7 8 9 1
# Append the number 4 to 'x' using the 'c()' function
c(x, 4)
## [1] 4 5 2 7 8 9 1 4
# Save out the new 'x' using the '<-' assign function
x <- c(x, 4)
x
## [1] 4 5 2 7 8 9 1 4
When using the c() function, we put ‘x’ in as the first argument followed by a ‘,’ and then the value we want to add ‘4’. Similar to the removing a value, we must reassign ‘x’ to ‘c(x, 4)’.
We can also append to x using the append() function. In the append() function the first argument is the vector to be modified and the second argument contains the values to be included in the modified vector.
# Recreate 'x'
x <- c(4, 5, 2, 7, 8, 9, 1)
x
## [1] 4 5 2 7 8 9 1
# Append the number 4 to 'x' using the 'c()' function
append(x, 4)
## [1] 4 5 2 7 8 9 1 4
# Save out the new 'x' using the '<-' assign function
x <- append(x, 4)
x
## [1] 4 5 2 7 8 9 1 4
You can also work with a vector of characters in R. A character is anything that is not a number. A character can be created by putting a value in single quotes ’’ or double quotes “”.
x <- 4
x
## [1] 4
y <- '4'
y
## [1] "4"
In this example, we created two variables, ‘x’ and ‘y’. They look similar, however, there is a major difference between the two. ‘x’ contains the numeric value of 4, whereas ‘y’ contains the character value of 4.
# Check to see if 'x' and 'y' are numeric or character type
is.character(x)
## [1] FALSE
is.numeric(x)
## [1] TRUE
is.character(y)
## [1] TRUE
is.numeric(y)
## [1] FALSE
The output shows that ‘x’ is a numeric value and ‘y’ is a character value. We can add, subtract, multiply, or divide ‘x’ by any value since it is numeric.
x
## [1] 4
x + 4
## [1] 8
x - 2
## [1] 2
x*5
## [1] 20
x/6
## [1] 0.6666667
However, if we try to add, subtract, multiply or divide ‘y’ by a value, it will result in an error.
Despite this problem, we can create a vector of strings and operate on them as we did above with the ‘x’ vector.
# Recreate 'x'
x <- c(4, 5, 2, 7, 8, 9, 1)
x
## [1] 4 5 2 7 8 9 1
y <- c('hi', 'hello', 'howdy')
y
## [1] "hi" "hello" "howdy"
We can isolate a value in ‘y’ just like we did in ‘x’ using the ‘[]’ operator. Let’s say we want to grab the 3rd position of y.
y
## [1] "hi" "hello" "howdy"
y[3]
## [1] "howdy"
We can remove a specific value from ‘y’using the’-‘sign in the’[]’ operator. For instance, let’s remove the 1st position of y.
y[1]
## [1] "hi"
y[-1]
## [1] "hello" "howdy"
y <- y[-1]
y
## [1] "hello" "howdy"
We can also append to the ‘y’ vector using the c() function.
y
## [1] "hello" "howdy"
y <- c(y, 'goodbye', 'ciao')
y
## [1] "hello" "howdy" "goodbye" "ciao"
y <- c('first', 'second', y)
y
## [1] "first" "second" "hello" "howdy" "goodbye" "ciao"
As you can see in the example above, you can add elements to ‘y’ before or after ‘y’. This depends on where you put ‘y’ when you call the c() function.
Dataframes are the bread and butter of working with data in R. There’s a universe of tools for working with this kind of data (i.e. see the tidyverse). When developing in R, you typically want to be working with dataframes because of their flexibility. They’re so important, I’m going to dedicate the whole next section to working with data frames.
A dataframe is an object that stores data in rows in columns. A column can be many different classes (i.e. one column can be a character where another is an integer).
x <- c(4, 5, 2, 7, 8, 9, 1)
y <- c("a", "a", "a", "b", "b", "b", "b")
mydata <- data.frame(x, y)
Where before we had single objects (i.e. x and y) that we indexed into, now we have an object that has both x and y. For those familiar with Excel, this can be thought of as two columns in a spreadsheet.
Now that you have the data in this new object that contains both our and x and y vectors, how do we do some operations on it? As with many tasks in R, there’s a number of ways to index data in a dataframe. The most common way is to use a ‘$’ sign to reference a column.
mydata$x
## [1] 4 5 2 7 8 9 1
Neat. Let’s say you wanted to access the third entry in this column. We can use the same indexing rules introduced earlier.
mydata$x[3]
## [1] 2
As an aside, if you’ve worked with indexing blocks of data before, you know that you can often just index both the rows and columns (e.g. mydata[row,column]). This is totally valid but can be a little unclear when trying to read through the code (i.e. what is that column??). As such, we recommend using the ‘$’ sign.
mydata[,1]
## [1] 4 5 2 7 8 9 1
mydata[3,1]
## [1] 2
This is a point of frustration for many developers getting started with R (many experienced developers as well). What are you talking about you ask? It’s called coercion. Let me illustrate:
class(y)
## [1] "character"
Now let’s take a look at that same y variable in our dataframe.
class(mydata$y)
## [1] "factor"
Wait. We never changed y to a factor. Wait. What just happened? It seems minor but trust me you will lose hours of your life because of this coercion. Trust me. You will be there. I promise. The best workaround is to simply specify that you would not like this coercion to happen.
mydata <- data.frame(x, y,
stringsAsFactors = FALSE)
class(mydata$y)
## [1] "character"
This seems minor and it is. Until you forget about it.
Like Python and MATLAB, R has basic plotting functions. You can create a number of different plots: line charts(plot()), histograms (hist()), bar charts (barplot()), and box plots (boxplots()). Here’s a nice quick resource for basic plots in R Quick-R Graphs Section.
We can create a histogram of the data using the hist() function
hist_data <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4)
hist_data
## [1] 1 1 1 2 2 3 3 3 3 4
hist(hist_data)
Let’s explore a simple scatter plot using the ‘x’ object we created before using the plot() function with ‘x’ as an arugment.
x <- c(4, 5, 2, 7, 8, 9, 1)
x
## [1] 4 5 2 7 8 9 1
# Call the plot function to plot x
plot(x)
Now we have a simple scatter plot of the ‘x’ object. The y-axis label is ‘x’ and the x-axis label is ‘index’. We can change these properties within the function call of plot. This really isn’t that informative so let’s make a new object that’s a function of x and plot it.
y <- 5*x
plot(x, y)
Sweet! We made a scatterplot in about 3 lines of code.
For loops are perfect for executing the same code for as many times as specified. They take the general form as below:
for (i in 1:100) {
# Do something i times
}
Here the for loop will go through 100 times (starting at 1 and going until i = 100). I end up using for loops for a number of things. When reading data in from a directory, I’ve used for loops to read in all the ‘i’ files in from the directory. I have also used it for simulating data sets (e.g. MCMC or bootstrap simulations).
Let’s say I wanted to get the sampling distribution of the mean of an exponential distribution. I could do that with a for loop.
expBoot <- NULL
# Generate 100 random numbers from a an exponential distribution with beta = 3
expSample <- rexp(100,3)
for (i in 1:1000) {
# Take the mean of a random sample of 100 from the random numbers
expBoot[i] <- mean(sample(expSample, 100))
}
This is pretty great. This is, in effect, simulating central limit theorem in about 5 or 6 lines of code.
for loops can be great for a lot of things, efficient code is not one of them (especially when working with larger datasets). To optimize efficiency, you’ll want to use something called vectorized operations or use a package that does vectorized operations for you to speed up processing.